Skip to content

Add persistent program cache for Program.compile#1912

Open
cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178
Open

Add persistent program cache for Program.compile#1912
cpcloud wants to merge 3 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

Adds a persistent on-disk cache for cuda.core.Program.compile outputs. The high-level integration is one keyword on Program.compile:

from cuda.core import Program, ProgramOptions
from cuda.core.utils import FileStreamProgramCache

source = 'extern "C" __global__ void k(int *a){ *a = 1; }'
options = ProgramOptions(arch="sm_80")

with FileStreamProgramCache() as cache:  # default: $XDG_CACHE_HOME/cuda-python/program-cache
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    obj.get_kernel("k")

A second invocation with the same inputs short-circuits the entire NVRTC compile — cache.get(key) (one stat + one read) and an ObjectCode._init from the bytes. No Program_compile is invoked. This is the fast path the cache exists to provide:

# Fresh process / second run -- same source, same options.
with FileStreamProgramCache() as cache:
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    # ~10us round-trip on a warm page cache, vs hundreds of ms to seconds
    # for an actual NVRTC invocation.

Public API

  • Program.compile(target_type, *, cache=...) — convenience wrapper. Derives the key, returns a fresh ObjectCode on hit, stores the compile output on miss.
  • cuda.core.utils.ProgramCacheResource — abstract bytes-in / bytes-out interface for custom backends. Provides get, update (Mapping or pairs), clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__). __contains__ is intentionally omitted: cache.get(key) is the recommended idiom because the two-call if key in cache: cache[key] pattern is racy across processes.
  • cuda.core.utils.InMemoryProgramCache — single-process LRU on OrderedDict, threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.
  • cuda.core.utils.FileStreamProgramCache — directory of atomic per-entry files. Safe across processes via os.replace + Windows sharing-violation retries on os.replace / read / unlink.
  • cuda.core.utils.make_program_cache_key — escape hatch when the compile inputs require an extra_digest (include_path, pre_include, pch, use_pch, pch_dir, NVVM use_libdevice=True, NVRTC options.name with a directory component). Program.compile(cache=...) rejects those compiles with a ValueError pointing here.

On-disk format

Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (cuobjdump, nvdisasm, cuda-gdb).

ObjectCode.symbol_mapping from name_expressions is not preserved across a cache round-trip; the wrapper rejects Program.compile(name_expressions=..., cache=...) outright so the first-call-works/second-call-breaks footgun can't surface. Callers that need get_kernel(name_expression) should compile without cache=.

FileStreamProgramCache

  • Atomic writes: stage to tmp/, fsync, os.replace into entries/<2char>/<hash>. Concurrent readers never observe partial writes. Windows os.replace retries on ERROR_ACCESS_DENIED / ERROR_SHARING_VIOLATION / ERROR_LOCK_VIOLATION (winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads and path.unlink so eviction doesn't crash the writer that triggered it on win-64.
  • Sharing-violation predicate: _is_windows_sharing_violation(exc) filters EACCES only when winerror is absent — non-sharing winerrors are real config errors and propagate. Off-Windows PermissionError always propagates.
  • Transparent input forms: cache[key] = value (and cache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or any ObjectCode (path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.
  • Size-only bound: max_size_bytes is the only knob — no element-count cap. None means unbounded.
  • True LRU via atime: every successful read calls os.utime (fd-based on Linux/macOS via os.supports_fd, path-based on Windows) to bump st_atime regardless of mount options or NtfsDisableLastAccessUpdate. Eviction sorts by oldest st_atime first. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.
  • Stat-guarded prunes: clear(), _enforce_size_cap(), and the atime touch all snapshot (ino, size, mtime_ns) per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.
  • Cache key derivation (make_program_cache_key): a backend-strategy pattern with one class per code_type (_NvrtcBackend / _LinkerBackend / _NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validates code_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.
  • NVRTC options.name with a directory component: rejected without extra_digest because NVRTC resolves quoted #include directives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.
  • PTX-loadability warning on cache hit: when the active driver can't load freshly-generated PTX, the wrapper emits the same RuntimeWarning the uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.
  • Default cache directory: when path is omitted, resolves via platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":
    • Linux/BSD: \$XDG_CACHE_HOME/cuda-python/program-cache (default ~/.cache/cuda-python/program-cache)
    • macOS: ~/Library/Caches/cuda-python/program-cache
    • Windows: %LOCALAPPDATA%\\cuda-python\\program-cache
  • tmp/ self-heal: if something deletes tmp/ after the cache is opened, the next write recreates it rather than crashing with FileNotFoundError.
  • Crashed-writer cleanup: stale temp files older than 1 hour are swept on open and on size-cap enforcement.

Test plan

  • tests/test_program_cache.py — abstract-class contract, update accepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backed ObjectCode / path-backed ObjectCode all round-trip to the same on-disk bytes), make_program_cache_key semantics (deterministic, supported-target matrix mirrors Program.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTC options.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir uses platformdirs, _is_windows_sharing_violation predicate's truth table including the regression case (non-sharing winerror plus EACCES propagates), tmp/ recreation after external wipe.
  • tests/test_program_cache_multiprocess.py — concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.
  • tests/test_program_compile_cache.pyProgram.compile(cache=...) miss/hit/error paths against a recording stub, name_expressions rejection, extra_digest-required / side-effect / NVRTC options.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28
Comment thread cuda_core/cuda/core/utils/__init__.py Outdated
@leofang
Copy link
Copy Markdown
Member

leofang commented Apr 22, 2026

Thanks, Phillip! I have this PR in my review backlog 🙏

The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers.

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 3a32786 to cad93d0 Compare April 22, 2026 12:04
@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note).

High -- source-directory include. make_program_cache_key() now refuses to build an NVRTC key when options.name contains a directory separator and neither extra_digest nor no_source_include=True is set. Scoping the guard to names that actually introduce a new search directory (/abs/kernel.cu, rel/kernel.cu, C:\src\kernel.cu) keeps bare labels like "default_program" or "kernel-a" -- which fall back to CWD, the same search root every NVRTC compile sees -- accepted unchanged. Tests cover POSIX and Windows separators, the extra_digest and no_source_include=True accept paths, and confirm the guard is NVRTC-only (PTX and NVVM unaffected).

Medium -- over-eviction race. FileStreamProgramCache._enforce_size_cap() now decrements total whether it unlinks the candidate itself or a concurrent pruner already removed the file. The FileNotFoundError is still suppressed, but the accounting now matches reality, so the loop stops as soon as the cap is met. Added a test that monkeypatches Path.unlink to simulate a concurrent deleter winning exactly once, then verifies the freshly-committed entry survives.

Low -- star-import. Added a note in cuda_core/cuda/core/utils/__init__.py that the laziness guarantee is for explicit imports only -- from cuda.core.utils import * walks __all__ and therefore resolves every lazy attribute. Star-imports are discouraged anyway, so treat that as expected.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:

InMemoryProgramCache -- thread-safe, not process-safe. Dict-backed (OrderedDict) cache that lives only in the owning process. threading.RLock serialises every method (__getitem__, __setitem__, __contains__, __delitem__, __len__, clear, and the internal eviction pass) so threads can share one cache object without external locking. It is not process-safe by design: each process has its own independent cache, there is no shared state or IPC; for multi-process sharing use SQLiteProgramCache or FileStreamProgramCache. Note that entries are stored by reference, not copied -- a thread that mutates a returned ObjectCode affects the cached entry, so callers must treat reads as read-only. Stressed by a threaded test with 4 writers + 4 readers x 200 ops against a cache with max_entries set, verifying the cache stays consistent and never exceeds the cap.

SQLiteProgramCache -- thread-safe; multi-process best-effort. check_same_thread=False on the connection plus a threading.RLock serialises every connection-touching method, so threads cannot interleave a read/update or a write/VACUUM pair. WAL + autocommit on open. Stressed by a 4 writers + 4 readers x 200 ops test in test_program_cache.py. Sharing the sqlite file across processes does work (sqlite3 WAL serialises writes at the file level and our size-cap/VACUUM passes run under the same WAL discipline), but the threading.RLock does not cross process boundaries and the VACUUM pass skips under active readers, so the on-disk file can temporarily grow above max_size_bytes until readers release. For workloads with many concurrent processes, FileStreamProgramCache is the better fit.

FileStreamProgramCache -- thread-safe and process-safe. Every write lands on a per-write temp file and is promoted via os.replace, so a reader/writer race either sees the old entry or the new one -- never a half-written file. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded: before unlinking, the code re-stats the candidate and refuses if (ino, size, mtime_ns) differs from the snapshot, so a concurrent writer's os.replace is preserved. Stale temp files are swept on open. On Windows, os.replace can surface ERROR_SHARING_VIOLATION (32) / ERROR_LOCK_VIOLATION (33) against a reader briefly holding the handle; the code retries with bounded backoff (~185ms total) before treating it as a non-fatal cache miss -- all other PermissionErrors and POSIX failures propagate.

Cross-process coverage in test_program_cache_multiprocess.py:

  • concurrent writers producing overlapping keys
  • a writer/reader race exercising the stat-guarded prune path
  • clear/eviction race injection via generator cleanup (the cleanup code after the last yield runs at StopIteration, which is exactly between _enforce_size_cap's scan and its eviction loop)
  • Windows PermissionError narrowing (winerror 32/33 swallow + retry, all others propagate)

One concurrency bug this review shook out (over-eviction after a suppressed FileNotFoundError in _enforce_size_cap) is fixed with its own test. If you see a DOE-style pattern from CuPy's cache that we don't cover yet, happy to add a test that reproduces it -- mapping that stress-testing onto this backend would be useful.

@cpcloud cpcloud changed the title Add PersistentProgramCache (sqlite + filestream backends) Add program caches (in-memory, sqlite, filestream) Apr 22, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 457cab7 to cfddd08 Compare April 23, 2026 12:35
@cpcloud cpcloud requested a review from rwgk April 23, 2026 15:00
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from cfddd08 to fce123f Compare April 24, 2026 12:41
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 27, 2026

FWIW, I briefly explored "safe pickle" and "signed pickle blobs" in this chat:

The conclusion there is:

  • Do not pickle compiled-kernel cache entries.
  • Use JSON / simple binary files for metadata and artifacts.

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 4 times, most recently from 7d1cb23 to 86dab90 Compare April 29, 2026 13:41
@cpcloud cpcloud changed the title Add program caches (in-memory, sqlite, filestream) Add persistent program cache for Program.compile Apr 29, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 86dab90 to a60f1c6 Compare April 29, 2026 14:08
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Phillip! Sorry for the long wait. Sending out the first wave of my review. Will continue asap.

Note: It would be nice if we can break up the two largest files (cuda_core/cuda/core/utils/_program_cache.py and cuda_core/tests/test_program_cache.py, each are 1-2k lines) into smaller logical units.

Comment thread cuda_core/pyproject.toml Outdated
Comment thread cuda_core/pixi.toml Outdated
Comment thread cuda_core/cuda/core/utils/_program_cache.py Outdated
Comment thread cuda_core/pixi.lock
Comment on lines +135 to +153
# ``name_expressions`` is incompatible with the cache: NVRTC
# populates ``ObjectCode.symbol_mapping`` from name-expression
# mangling at compile time, and that mapping isn't carried in
# the binary bytes the cache stores. Without this guard the
# first call (cache miss) would return an ObjectCode with
# symbol_mapping populated, while every subsequent call (hit)
# would return one without -- silently breaking later
# ``get_kernel(name_expression)`` lookups that work on the
# uncached path. Fail loud here instead.
if name_expressions:
raise ValueError(
"Program.compile(cache=...) does not support name_expressions: "
"ObjectCode.symbol_mapping is populated by NVRTC at compile "
"time and is not preserved across a cache round-trip, so cache "
"hits would silently break get_kernel(name_expression) lookups "
"that the uncached path supports. Compile without cache= when "
"name_expressions are needed, or look up mangled symbols by "
"hand from the cached ObjectCode."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: I need to address this after 1.0 is out, xref: cupy/cupy#9801

Comment on lines +1056 to +1066
def __getitem__(self, key: object) -> bytes:
k = _as_key_bytes(key)
with self._lock:
try:
data, _size = self._entries[k]
except KeyError:
raise KeyError(key) from None
# Touch LRU: a real read promotes the entry to "most recent"
# so eviction prefers genuinely cold entries.
self._entries.move_to_end(k)
return data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What would be our recommended way of using InMemoryProgramCache in a multi-GPU env? Wondering about this because we usually have each GPU driven by a thread, and if the intended use case is a global cache object (which makes sense on a homogeneous system like DGX) this would cause serialization.

In CuPy internally there is a per-device cache so this issue is avoided.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 30, 2026

@leofang -- on the multi-GPU question. Two options worth weighing, both viable:

Background. make_program_cache_key already mixes arch into the digest, so on a homogeneous system (all sm_80, all sm_90) every device produces the same key for the same source -- sharing one cache amortises compiles across every GPU. The single-lock cost is real but bounded: the dict update + LRU bump runs in microseconds, and the compile itself runs outside the lock. So the contention only matters during compile-heavy startup with many threads, on heterogeneous-arch systems, or when profiling shows it.

Option A -- document the dict-of-caches pattern, no API change.

caches = {d.device_id: InMemoryProgramCache() for d in devices}
# per thread:
program.compile(\"cubin\", cache=caches[Device().device_id])
  • Pros: zero API surface added; `Program.compile(cache=...)` is duck-typed today, so callers can also wrap the dict in a thin router themselves.
  • Cons: callers thread the dict through every compile site; easy to forget the device-id key.

Option B -- ship a `PerDeviceProgramCache` routing wrapper.

  • Pros: one-liner for callers (`cache = PerDeviceProgramCache()`); routing on `Device().device_id` lives inside the class.
  • Cons: couples the cache layer to `Device.current()` (every `getitem` reaches into the device API); more API surface to lock down before 1.0; harder to mock in tests.

I lean A for the first cut: the common case (homogeneous DGX, single SKU) is correctly served by one shared cache and benefits from cross-device amortisation, the heterogeneous-arch case is a 3-line dict away, and B can ship post-1.0 if real workloads make it pattern enough to deserve a class. Happy to pivot to B if you'd rather have it on the public surface from day one.

cpcloud added 2 commits April 30, 2026 04:45
Add a bytes-in / bytes-out cache abstraction and two backends for
caching compiled CUDA programs across process boundaries.

* ``ProgramCacheResource`` -- abstract base. Concrete backends store
  raw binary bytes keyed by ``bytes`` or ``str``; reads return the
  same payload. ``__setitem__`` accepts ``bytes``, ``bytearray``,
  ``memoryview``, or any :class:`~cuda.core.ObjectCode` (path-backed
  too -- the file is read at write time so the cached entry holds
  the binary content, not a path that could move). Provides default
  ``get``, ``update`` (mapping or pairs), ``close``, and context
  manager. ``__contains__`` is intentionally NOT abstract: the racy
  ``if key in cache; data = cache[key]`` idiom is steered toward
  ``cache.get(key)`` instead.

* ``InMemoryProgramCache`` -- single-process LRU on
  ``collections.OrderedDict`` with ``threading.RLock`` and a
  size-only cap. Reads promote via ``move_to_end``.

* ``FileStreamProgramCache`` -- directory of atomic per-entry files.
  Writes stage to ``tmp/`` then ``os.replace`` into
  ``entries/<2-char>/<blake2b-hex>``; concurrent readers never see a
  torn file. Each entry is the raw compiled binary (no pickle, no
  framing) so files are directly consumable by external NVIDIA
  tools (``cuobjdump``, ``nvdisasm``, ``cuda-gdb``). Eviction is
  true LRU via ``st_atime`` (the read path calls ``os.utime`` to
  bypass ``relatime`` / ``NtfsDisableLastAccessUpdate`` /
  ``noatime``). Stat-guarded prunes refuse to unlink entries
  another process replaced mid-eviction. ``tmp/`` is recreated on
  every write so an external wipe doesn't crash later writes.
  Default cache directory comes from
  ``platformdirs.user_cache_path("cuda-python", appauthor=False,
  opinion=False) / "program-cache"``.

* Windows sharing-violation handling -- ``os.replace``,
  ``path.stat() + read_bytes()``, and ``path.unlink`` all retry on
  winerror 5/32/33 with a bounded backoff (~185 ms). The
  ``_is_windows_sharing_violation`` predicate filters EACCES only
  when ``winerror`` is absent so non-sharing winerrors propagate as
  the real config errors they are. Off-Windows ``PermissionError``
  always propagates.

* ``make_program_cache_key`` -- escape hatch for callers whose
  compile inputs require an ``extra_digest`` (header / PCH content
  fingerprints, NVVM libdevice). Builds a 32-byte blake2b digest
  via a backend-strategy pattern: a ``_KeyBackend`` ABC with
  per-code-type subclasses (``_NvrtcBackend``, ``_LinkerBackend``,
  ``_NvvmBackend``) owns each backend's validation, code coercion,
  option fingerprinting, name-expression handling, version probe,
  and extra-payload hashing. The orchestrator dispatches via
  ``_BACKENDS_BY_CODE_TYPE[code_type]`` and assembles the digest in
  fixed order. Backend gates match ``Program.compile``: rejects
  inputs the real compile would reject (side-effect options,
  external-content options without an ``extra_digest``,
  driver-linker-unsupported options, NVRTC ``options.name`` with a
  directory component). NVVM ``extra_sources`` is hashed in
  caller-provided order because NVVM module linking is
  order-dependent in the general case (overlapping symbols, weak
  definitions); canonicalising would silently change behavior for
  order-dependent inputs.

Adds ``platformdirs >=3.0`` to ``cuda_core/pyproject.toml`` and the
matching pixi manifests.

Tests cover the abstract contract, key-construction matrix
(deterministic, supported-target gates, backend-probe taint, gate
canonicalization, side-effect / external-content / dir-component
guards, schema version mixing), single-process CRUD and LRU,
atomic-write race coverage, atime LRU promotion, stat-guarded
prune / atime touch / clear / size-cap, default-dir resolution via
platformdirs, the ``_is_windows_sharing_violation`` predicate's
truth table including the regression case (non-sharing winerror
plus EACCES propagates), tmp-dir recreation after external wipe,
multiprocess concurrent writers / reader-vs-writer torn-file
safety / size-cap eviction race.
Adds a ``cache=`` keyword to :meth:`cuda.core.Program.compile` that
threads the persistent cache machinery into the high-level compile
path. With ``cache=None`` (the default) the call is byte-identical
to the un-cached path -- no key derivation, no extra import, no
behavior change.

When a cache is provided, the wrapper derives a key via
:func:`~cuda.core.utils.make_program_cache_key` from the program's
source, options, and target type; checks the cache; on hit,
returns a fresh
``ObjectCode._init(hit_bytes, target_type, name=self._options.name)``;
on miss, runs the underlying compile and stores
``cache[key] = compiled`` (the cache extracts ``bytes(obj.code)``).

Two compile-time guards close obvious footguns:

* ``name_expressions`` plus ``cache=`` raises ``ValueError``.
  NVRTC populates ``ObjectCode.symbol_mapping`` from name-expression
  mangling at compile time, and that mapping isn't carried in the
  binary the cache stores. Without this guard the first call (miss)
  would return an ObjectCode with mappings populated, while every
  subsequent call (hit) would return one without -- silently
  breaking later ``get_kernel(name_expression)`` lookups that work
  on the uncached path. Compiles that need name_expressions should
  run without ``cache=``, or look up mangled symbols by hand from
  the cached ``ObjectCode``.

* Inputs whose compilation effect isn't captured by the key
  (``include_path``, ``pre_include``, ``pch``, ``use_pch``,
  ``pch_dir``, NVVM ``use_libdevice=True``, NVRTC ``options.name``
  with a directory component, side-effect options like
  ``create_pch`` / ``time`` / ``fdevice_time_trace``) propagate the
  ``ValueError`` from ``make_program_cache_key`` -- those callers
  should use ``make_program_cache_key`` directly with an
  ``extra_digest`` covering the external content.

Cache hits also mirror the uncached path's NVRTC-PTX loadability
warning: when ``self._backend == "NVRTC"``, ``target_type ==
"ptx"``, and ``_can_load_generated_ptx()`` returns False, a
``RuntimeWarning`` is emitted before returning the cached bytes.
Loadability is a property of the active driver, not of how the
bytes were produced, so the warning applies equally to cached PTX.

Supporting refactors:

* Unify ``Program``'s source retention into a single ``_code``
  field (was split between ``_code`` for NVVM and a separate
  ``_source`` for c++/ptx). ``_code`` is now always bytes; the
  cache wrapper decodes back to ``str`` for c++/ptx before passing
  to ``make_program_cache_key`` (which only accepts bytes for NVVM).

* Move the actual compile call into a module-level
  ``_program_compile_uncached`` so tests can monkeypatch the seam
  without going through NVRTC. ``Program`` is a ``cdef class``, so
  its methods cannot be reassigned from Python -- the seam has to
  live outside the class.

* The unified ``_code`` field also exposed a pre-existing bug on
  the NVVM path: the C pointer was being recomputed from the
  caller's original ``code`` argument rather than from
  ``self._code``, which crashed for ``bytearray`` inputs that the
  field's bytes coercion handled cleanly. Fixed; regression test
  added in ``test_program.py``.

Tests in ``test_program_compile_cache.py`` cover both halves of the
contract: the wrapper-level miss/hit/error paths against a recording
stub (verifying it's duck-typed and doesn't require subclassing
``ProgramCacheResource``), the rejection paths (name_expressions,
extra_digest-required options, side-effect options, NVRTC
``options.name`` with a directory component), the PTX loadability
warning on cache hit (positive: warns when the driver can't load
the cached PTX; negative: stays quiet otherwise), and a real NVRTC
end-to-end roundtrip using ``FileStreamProgramCache`` across reopen
so the bytes match across processes.
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from a60f1c6 to d177450 Compare April 30, 2026 08:45
@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 30, 2026

Pushed d177450 addressing the first review wave:

platformdirs dropped (cuda_core/pyproject.toml, cuda_core/pixi.toml, cuda_core/cuda/core/utils/_program_cache/_file_stream.py). _default_cache_dir is now ~10 lines: `$XDG_CACHE_HOME or ~/.cache` on Linux, `%LOCALAPPDATA% or ~/AppData/Local` on Windows -- no macOS branch since CUDA doesn't support it. pixi.lock is reverted to upstream (the lock churn was solely from the now-removed dep). Tests in test_default_cache_dir_lives_under_user_cache_root parametrise both branches.

_program_cache.py split into a package (cuda_core/cuda/core/utils/_program_cache/{__init__.py,_abc.py,_keys.py,_in_memory.py,_file_stream.py}) -- 1700 lines → 5 focused submodules, each ≤ 700 lines. Public surface (`FileStreamProgramCache`, `InMemoryProgramCache`, `ProgramCacheResource`, `make_program_cache_key`) re-exported from the package, so external imports are unchanged. Tests that monkeypatch internals were updated to address the owning submodule directly (_program_cache._keys._linker_backend_and_version, _program_cache._file_stream._IS_WINDOWS, etc.) -- the package-level symbols are convenience aliases and don't intercept calls within submodules.

test_program_cache.py file split deferred. The 2179-line file is internally well-organised by section (key construction / InMemory / FileStream / multi-process), and after the source split the test reorganisation is mechanical churn rather than a clarity win. Happy to do it as a follow-up if you'd prefer.

Multi-GPU question replied to in #issuecomment-4350987016.

@cpcloud cpcloud linked an issue May 4, 2026 that may be closed by this pull request
Comment thread cuda_core/cuda/core/utils/__init__.py Outdated
Drop the module-level __getattr__/__dir__ shim that lazily exposed the
program-cache classes. Import them eagerly alongside the memoryview
helpers. Remove the two tests that pinned the lazy-import behaviour.

Also pick up ruff's auto-fixes (import-block blank lines, long-line
reformat) and rename the unused classmethod argument to satisfy ARG005.
Comment on lines +256 to +257
Program caches
--------------
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge this section with CUDA compilation toolchain above, or at least move it and make them stay closer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, just merging with the above section is fine too.

self[key] = value

def close(self) -> None: # noqa: B027
"""Release backend resources. No-op by default."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no-op only for trivial backends. If users don't use it as a context manager (well... it's a separate discussion why they should..), then they better call .close() for consistency and portability (to another backend).

obj = program.compile("cubin")
cache[key] = obj # extracts bytes(obj.code)
else:
obj = ObjectCode._init(data, "cubin")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectCode has many public constructors. We must not teach about any private methods.

Comment on lines +25 to +26
Intentionally does NOT subclass ``ProgramCacheResource`` -- the wrapper
should be duck-typed, so we test the duck-typed surface directly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we require each cache= instance to subclass from ProgramCacheResource?


cpdef bint _can_load_generated_ptx() except? -1:
"""Check if the driver can load PTX generated by the current NVRTC version."""
def _can_load_generated_ptx():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpdef functions are also accessible from Python, is this change needed?

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the persistent program cache implementation. Findings are categorized inline as:

  • Critical (1): Must fix before merging — cache-write failure drops a successfully compiled ObjectCode.
  • Consideration (8): Performance/functionality concerns worth discussing; can be deferred but should be tracked.
  • Nitpick (6): Not blockers for merging.

This excludes items already captured in Leo's and rwgk's earlier review comments (platformdirs removal, over-eviction race, source-directory include guard, ObjectCode._init in docstrings, cpdefdef change, duck-typed test, close() vs context manager, multi-GPU usage, star-import laziness, doc section placement, SQLiteProgramCache removal).

Overall this is a well-engineered piece of work — the TOCTOU handling, stat-guards, and atomic-write design are thorough and well-documented. The main concern is the cache-write failure path losing the compile result.

Comment thread cuda_core/cuda/core/_program.pyx
Comment on lines +162 to +180
# ``self._code`` is always stored as bytes (see ``Program_init``),
# but ``make_program_cache_key`` only accepts bytes when
# ``code_type == "nvvm"`` -- c++/ptx must be ``str``. Decode back
# to the original str for the NVRTC/linker paths so the generated
# key matches keys callers build by passing the str source
# directly.
code_for_key = self._code if self._code_type == "nvvm" else self._code.decode("utf-8")

key = make_program_cache_key(
code=code_for_key,
code_type=self._code_type,
options=self._options,
target_type=target_type,
)
hit_bytes = cache.get(key)
if hit_bytes is not None:
# The uncached NVRTC path warns when the active driver can't
# load freshly-generated PTX; that loadability is a property
# of the driver, not of how the bytes were produced, so the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: logs silently receives nothing on cache hit.

When the cache hits, no compilation occurs, so logs.write() is never called. A caller relying on logs to confirm compilation ran (or to capture PTX output) will get silence with no indication it was a cache hit vs. a compile that produced no output. Worth a note in the cache= or logs parameter docstring, e.g.:

On a cache hit, no compilation is performed and logs receives no output.

Comment on lines +494 to +515
name for name in _DRIVER_LINKER_UNSUPPORTED_FIELDS if getattr(options, name, None) is not None
]
if unsupported:
raise ValueError(
f"the cuLink driver linker does not support these options: "
f"{', '.join(unsupported)}; Program.compile() would reject this "
f"configuration before producing an ObjectCode."
)

def option_fingerprint(self, options, target_type): # noqa: ARG002
# For PTX inputs the Linker reads only a subset of ProgramOptions
# (see ``_translate_program_options`` in _program.pyx); fingerprint
# just those fields so shared ProgramOptions carrying NVRTC-only
# flags (``include_path``, ``pch_*``, ``frandom_seed``, ...) don't
# force spurious cache misses on PTX.
return _linker_option_fingerprint(options, use_driver_linker=self._decide_driver())

def hash_version_probe(self, update):
# Only cuLink (driver-backed linker) goes through the CUDA driver
# for codegen. nvJitLink is a separate library, so a driver
# upgrade under it does not change the compiled bytes -- skip the
# driver version there. ``_linker_backend_and_version`` already
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: _decide_driver() is called 2–3 times per key derivation for PTX inputs.

validate() calls self._decide_driver() at line 494, option_fingerprint() calls it again at line 515, and hash_version_probe() calls _linker_backend_and_version() which re-probes independently. Each call re-imports and re-invokes _decide_nvjitlink_or_driver().

If the result were to differ between calls within the same invocation (transient import failure recovering), the key would be internally inconsistent — validation used one backend, fingerprint used another. Even though this is unlikely in practice, computing the result once and threading it through (or caching it on the instance for the duration of one make_program_cache_key call) would be cleaner and more robust.

if target_type not in _VALID_TARGET_TYPES:
raise ValueError(f"target_type={target_type!r} is not supported (must be one of {sorted(_VALID_TARGET_TYPES)})")
supported_for_code = _SUPPORTED_TARGETS_BY_CODE_TYPE[code_type]
if target_type not in supported_for_code:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: target_type is not case-normalized, asymmetric with code_type.

code_type is lowered here to ensure "PTX" and "ptx" route the same way. But target_type is not normalized — make_program_cache_key(target_type="PTX") and make_program_cache_key(target_type="ptx") produce different keys for the same compilation. The comment on lines 728–730 says "a caller that passes 'PTX' or 'C++' must get the same routing" but only addresses code_type.

Program.compile may normalize target_type internally, but standalone callers of make_program_cache_key are exposed to this instability. Consider adding target_type = target_type.lower() here to match.

Comment on lines +116 to +131
"debug": _gate_truthy,
"lineinfo": _gate_truthy,
"ftz": _gate_tristate_bool,
"prec_div": _gate_tristate_bool,
"prec_sqrt": _gate_tristate_bool,
"fma": _gate_tristate_bool,
"split_compile": _gate_identity,
"ptxas_options": _gate_ptxas_options,
"no_cache": _gate_is_true,
}


# LinkerOptions fields the ``cuLink`` driver backend silently ignores
# (emits only a DeprecationWarning; no actual flag reaches the compiler).
# When the driver backend is active, collapse them to a single sentinel in
# the fingerprint so nvJitLink<->driver parity of ``ObjectCode`` doesn't
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: _LINKER_RELEVANT_FIELDS and _LINKER_FIELD_GATES must be kept in sync manually.

A field added to _LINKER_FIELD_GATES but forgotten in _LINKER_RELEVANT_FIELDS would be silently ignored — the gate exists but is never iterated over in _linker_option_fingerprint. The reverse (missing from _LINKER_FIELD_GATES) would KeyError at runtime, which is fail-fast but late.

A test assertion like assert set(_LINKER_RELEVANT_FIELDS) == set(_LINKER_FIELD_GATES.keys()) would close both directions cheaply. Same applies for sync with _translate_program_options in _program.pyx — the existing test_make_program_cache_key_supported_targets_matches_program_compile guards target drift but not field drift.

Comment on lines +491 to +494
def clear(self) -> None:
# Snapshot stat alongside path so we can refuse to unlink an entry
# that was concurrently replaced by another process between the
# snapshot scan and the unlink. Same stat-guard contract as
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: __len__ double-stats every entry.

_iter_entry_paths already filters with entry.is_file() (line 531), so every yielded path is known to be a file. The path.is_file() check here in __len__ is redundant — it issues a second stat syscall per entry. For a large cache this is 2N unnecessary stat calls. Safe to remove the inner check:

def __len__(self) -> int:
    return sum(1 for _ in self._iter_entry_paths())

Comment on lines +391 to +392
self._root.mkdir(parents=True, exist_ok=True)
self._entries.mkdir(exist_ok=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: max_size_bytes=0 is accepted but makes the cache a black hole.

Both InMemoryProgramCache and FileStreamProgramCache validate max_size_bytes >= 0, so max_size_bytes=0 passes. But a zero-byte cap means every write is immediately evicted (any payload has size > 0). This is almost certainly a user error. Consider either rejecting it with a ValueError, or documenting it explicitly as "effectively disables caching."

Comment on lines +100 to +113
return None
if isinstance(v, str):
return ("-Xptxas=" + v,)
if isinstance(v, collections.abc.Sequence):
if len(v) == 0:
return None
return tuple(f"-Xptxas={s}" for s in v)
return v


_LINKER_FIELD_GATES = {
"name": _gate_identity,
"arch": _gate_identity,
"max_register_count": _gate_identity,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: ptxas_options fingerprint preserves list order.

["-v", "-O2"] and ["-O2", "-v"] produce different cache keys even if ptxas treats them equivalently. This causes spurious cache misses (not collisions), so it's safe but suboptimal. The conservative choice (preserve order) is defensible if ptxas flag order is significant for some flags. Worth a brief comment noting the choice is intentional.

)
from cuda.core._utils.cuda_utils import (
handle_return as _handle_return,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Top-level import creates a fragile circular-import dependency.

_keys.py imports ProgramOptions from cuda.core._program at module level, while _program.pyx does a deferred import of make_program_cache_key inside compile(). The deferred import breaks the cycle today, but if someone later adds from cuda.core.utils import ... at the top of _program.pyx, it would break. Worth a # NOTE: ... comment here documenting the mutual dependency and why _keys.py gets the top-level import while _program.pyx defers.

if extra_digest is not None:
_update("extra_digest", bytes(extra_digest))

return hasher.digest()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: options.name is double-hashed for the linker (PTX) path.

For PTX inputs, "name" is in _LINKER_RELEVANT_FIELDS and goes through _linker_option_fingerprint via _gate_identity. Then this universal block hashes it again under the "options_name" label. Not a collision or instability bug (hashing the same value twice under different labels is safe), but it's redundant work for the linker path. Worth a comment noting the intentional redundancy, or dedup by skipping the linker path here.

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up batch — 6 additional inline comments from the first review round:

  • Consideration (3): temp file burst-write thrashing, O(n) _enforce_size_cap on every write, UTF-8 decode introducing a new failure mode in the cache path.
  • Nitpick (3): stat-key triple dedup, Windows sharing-retry dedup, _KeyBackend class hierarchy vs simpler function dispatch.

Comment on lines +569 to +577
if self._tmp.exists():
for tmp in self._tmp.iterdir():
if not tmp.is_file():
continue
try:
total += tmp.stat().st_size
except FileNotFoundError:
continue
if total <= self._max_size_bytes:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: Temp file accounting under burst writes can cause committed-entry thrashing.

_enforce_size_cap counts temp files toward total but can only evict committed entries (the entries list). During high-concurrency writes, many young temp files (each under the 1h stale-sweep threshold) could push total persistently above the cap with no way to recover — the loop would keep evicting committed entries while the in-flight temps hold the total above max_size_bytes. Once the burst subsides and the temps are committed/cleaned, the evicted entries are gone.

This is a soft-cap design so it's consistent with the documented contract, but worth noting that burst write concurrency can cause disproportionate eviction of committed entries.

with contextlib.suppress(FileNotFoundError):
tmp_path.unlink()
raise
self._enforce_size_cap()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: _enforce_size_cap is O(n) on every __setitem__.

Every write stats all files in entries/ plus tmp/ to compute the total. For a large cache (thousands of entries), this could be measurably costly on every compile. An incremental size tracker (add on write, subtract on eviction, periodic reconciliation to correct drift from external deletions) would make writes O(1) in the common case, falling back to a full scan only when reconciliation is needed.

# to the original str for the NVRTC/linker paths so the generated
# key matches keys callers build by passing the str source
# directly.
code_for_key = self._code if self._code_type == "nvvm" else self._code.decode("utf-8")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: UTF-8 decode introduces a new failure mode in the cache path.

If the source code contains non-UTF-8 bytes (e.g. Latin-1 encoded comments), this .decode("utf-8") raises UnicodeDecodeError even though the uncached compile path would succeed (NVRTC accepts raw bytes). The cache path introduces a failure that doesn't exist without cache=.

This is likely rare in practice (CUDA source is almost always ASCII/UTF-8), but worth documenting in the cache= parameter docstring or catching with a clear error message pointing to the limitation.

Comment on lines +190 to +195
if (st_now.st_ino, st_now.st_size, st_now.st_mtime_ns) != (
st_before.st_ino,
st_before.st_size,
st_before.st_mtime_ns,
):
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Stat-key triple (st_ino, st_size, st_mtime_ns) is repeated in 4 places.

This exact pattern appears in _touch_atime (twice — fd-based and path-based), _prune_if_stat_unchanged, and _enforce_size_cap. Extracting a small helper would reduce duplication and make the invariant easier to audit:

def _stat_key(st: os.stat_result) -> tuple:
    return (st.st_ino, st.st_size, st.st_mtime_ns)

Then each site becomes if _stat_key(st_now) != _stat_key(st_before): return.

Comment on lines +84 to +97
for i, delay in enumerate(_REPLACE_RETRY_DELAYS):
if delay:
time.sleep(delay)
try:
os.replace(tmp_path, target)
return True
except PermissionError as exc:
if not _IS_WINDOWS or getattr(exc, "winerror", None) not in _SHARING_VIOLATION_WINERRORS:
raise
# Windows sharing violation; loop and try again unless this was the
# last attempt, in which case fall through and return False.
if i == len(_REPLACE_RETRY_DELAYS) - 1:
return False
return False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Windows sharing-violation retry pattern is repeated across 3 functions.

_replace_with_sharing_retry, _stat_and_read_with_sharing_retry, and _unlink_with_sharing_retry all iterate over _REPLACE_RETRY_DELAYS, sleep, try an operation, catch PermissionError, and check _is_windows_sharing_violation. A generic retry helper parameterized on the operation could deduplicate this:

def _with_sharing_retry(op, *args, on_exhausted=None, **kwargs):
    last_exc = None
    for delay in _REPLACE_RETRY_DELAYS:
        if delay:
            time.sleep(delay)
        try:
            return op(*args, **kwargs)
        except PermissionError as exc:
            if not _is_windows_sharing_violation(exc):
                raise
            last_exc = exc
    if on_exhausted is not None:
        return on_exhausted(last_exc)
    raise last_exc

Not critical — the current code is clear and correct — but it's 3× the same loop structure.

update(f"{label}_probe_failed", type(exc).__name__.encode())


class _KeyBackend(abc.ABC):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: _KeyBackend class hierarchy vs plain function dispatch.

Three stateless singleton classes (_NvrtcBackend, _LinkerBackend, _NvvmBackend) for what is essentially a 3-way dispatch on code_type. The ABC does document the contract clearly (6 abstract/overridable methods), but three plain functions (or a dict of named tuples of functions) dispatched from make_program_cache_key would be simpler and equally extensible — there will likely never be a 4th code_type. The class hierarchy adds indirection and cognitive overhead (readers must trace through the ABC to understand the dispatch) without a clear extensibility payoff.

Not a correctness issue — the current design works — just noting the complexity cost relative to a simpler alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement compiler caches

4 participants